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1. INTRODUCTION 

Social networks connect us with other people, sharing aspects of our life. Communication via online 
social networks has become a part of one’s life in this digital age. Social networks, especially Facebook, are 
widely used in the daily lives of people and have been growing rapidly. In Thailand, Facebook is the most 
popular social media network and reached around 50.75 million users in 2022, despite slower growth forecasted 
to reach 45 million by 2026 [1]. Due to the numerous users and sharing nature of social networks, deceivers 
may easily spread messages with malicious intentions to other users (victims). Therefore, detecting and acting 
upon deceptive messages within social networks is of rising importance to mitigate the number of victims of 
deception and related online fraud. Various approaches have been proposed in the literature to detect online 
deception in messages. They can be grouped into two categories. Machine learning methods are the first 
category exploited to detect deception in social media communications. Briscoe et al. [2] classified deception 
by machine learning methods—random forest, gradient boosting, support vector machine (SVM), and 
Perceptron—on features, including sentence length, sentence complexity, sentiment, emoticon usage, and 
informality. Appling et al. [3] proposed random forest and SVM to evaluate deception strategies by using textual 
cues. Ott et al. [4] investigated deceptive opinion spam on hotel reviews with several extracted features i.e., 
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psychological deception, n-gram-based text categorization, and a combination of both n-gram and 
psychological deception. Naive Bayes and SVM then classified the spam opinion inputs. Deceptive opinion 
spam detection was also investigated in [5] and [6]. Rayana and Akoglu [5] used the behavior and content of 
messages as features for classifying opinion spam with the SPEAGLE approach. The mislabeled instances 
were corrected to find deceptive opinion spam from message reviews [6]. Features including lexicon, part of 
speech, deep syntactic information of text, and psycholinguistic features, were extracted from the reviews and 
classified by Naive Bayes, SVM, and logistic regression classifiers. Moreover, machine learning methods were 
applied to detect spam messages which led to online deception [7], [8]. Bhat et al. [9] compared single models 
and ensemble models for spammer classification, including decision tree, Naive Bayes and K-nearest neighbors 
(KNN), Bagging, Boosting, and Stacking ensemble classifiers. Zheng et al. [10] and Gupta and Kaushal [11] 
detected spammers and non-spammers on social networks from message content and user behavior, using 
SVM, Naive Bayes, and decision tree classifiers. Abdulqader et al. [12] proposed ten theories and nine relevant 
constructs to create a model for detecting fake online reviews. The ten theories include self-presentational 
theory, four-factor theory, interpersonal deception theory, leakage theory, truth-default theory, reality 
monitoring theory, criteria-based content analysis, scientific content analysis, verifiability approach, and 
information manipulation theory. The constructs include specificity, quantity, non-immediacy, affect, 
uncertainty, informality, consistency, source credibility, and deviation in behavior. Verbal and non-verbal 
features were investigated to validate the proposed model and found that non-verbal features are more 
important than verbal features. 

Secondly, deep learning methods have been proposed to detect online deception. Jain et al. [13] 
evaluated multiple deep neural network (DNN) based approaches to detect deceptive reviews. The 
performances of these models were compared on multiple benchmark datasets. In addition, a multi-instance 
learning and hierarchical architecture handling variable length review texts were reported to have outperformed 
other machine learning methods. Anass et al. [14] reported the comparison between different neural network 
architectures and their effectiveness in the detection of deceptive opinion spam. Their results showed that 
convolutional neural network (CNN) performed better than recurrent neural network (RNN), long short-term 
memory (LSTM), bidirectional long short-term memory (BiLSTM), gated recurrent units (GRU), and 
bidirectional gated recurrent units (BiGRU). Zhang et al. [15] proposed a deep learning approach, called deep 
context representation by word vectors (DCWord), for text representation to deceptive review identification. 
The basic idea is that contextual information of words of deceptive reviews and truthful reviews should have 
different characteristics. The average-pooling technique is applied to the word vector-encoded data. 
Experimental results reported that the DCWord-M representation with logistic regression gave the highest 
accuracy for detecting deceptive reviews. Zhang et al. [16] proposed a deep learning method, called deceptive 
review identification by recurrent convolutional neural network (DRI-RCNN) to identify deceptive reviews. It 
used word contexts and a deep learning technique to detect deceptive reviews. Their experiment found that the 
DRI-RCNN outperformed SVM in deceptive review identification. Qureshi et al. [17] applied a feature set 
combining the connectivity patterns of news propagators with their profile features to detect COVID-19 Fake 
News. Various machine learning and deep learning models are investigated for the detection and found that 
CATBoost and RNN are the most effective. 

These previous works show deep learning outperforms machine learning in deceptive message 
identification, where the ultimate cause is that hand-engineered feature extraction (in machine learning 
techniques) does not provide the necessary semantic information from the text data to discriminate the 
deceptive indicators [14]. Therefore, we apply deep learning models to detect deceptive Thai-language 
messages from Facebook sources. In addition, Thai language messages are complex in a different manner to 
English-language messages, and the techniques in this paper contribute to the much-needed work in this very 
challenging and under-explored specialization of deception detection. 

The rest of this paper is organized as follows: in section 2, the proposed method is explained. In section 3, 
we show the experimental results and give a discussion. Finally, the conclusion and direction of future work 
are given in section 4. 


2. METHOD 

The overall process of the method is shown in Figure | and consists of 3 main parts; data preparation, 
the proposed models, and evaluation. For the proposed models, the CNN, BiLSTM, and BiGRU models are 
proposed to classify deceptive messages. We also proposed a set of hybrid models i.e, CNN-BiLSTM, and 
CNN-BiGRU, for detecting deceptive messages. 
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Figure |. The proposed method 


2.1. Data preparation 

The dataset was collected by extracting textual messages from Facebook pages relevant to job 
applications written in Thai-language. Truthful messages were collected from reliable pages. The deceptive 
messages were collected from unreliable pages. The detail of the dataset collection process can be found in [18]. 
1,189 truthful and 1,189 deceptive messages are in the dataset. The dataset’s messages have a mean average length 
of 926 characters. Since deep learning models cannot learn directly from raw text data, a procedure to transform 
input messages into feature vectors is required. First, the messages are cleaned by removing numbers, signs, and 
stop words. Then the remaining character sequence (message) is segmented into words using OSKut [19]. These 
vectors of words undergo sequence padding by prepending 0's to ensure equal lengths at 534 words (the maximum 
length of a processed message vector within the dataset). Each message instance is thus a feature (or word) vector. 
In our experiments, two feature vector encodings are tested. The first is the one-hot encoding technique, the second 
is performed using Thai2fit [20]. Thai2fit is a pre-trained Thai word embedding technique that has been trained 
with Thai Wikipedia data by an ULMFit method—each word is represented by a 300-dimension vector. 


2.2. The proposed models 
2.2.1. CNN 

CNN has widely been used for text classification and had favorable results in various domains of text 
classification [21]. Underlying, it is a feed-forward neural network. It has a convolutional layer for generating 
feature maps. Then the size of the feature map is reduced by using the pooling layer. Finally, the softmax layer 
(one of several activation functions) is used as a classification output. The structure of the proposed CNN 
model for classifying deception messages is shown in Figure 2. An input feature vector is fed into the 
convolutional layer to learn information from words in sentences through the filters. In the proposed CNN 
model, the input feature vector is convolved by using 32 filters with sizes 3x300 each. The output from the 
convolution of each filter is a feature map with size h-(3-1) when striding = 1, where h is the number of words 
in the message. The feature map is pooled by using 1D-max pooling layer to generate a convolutional feature 
with a size of 32. The feature map is flattened into a single column using the flatten layer. A dropout layer is 
applied for regularization to reduce overfitting in the model. Finally, the output layer gives the predicted class. 
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Figure 2. The structure of the proposed CNN model 
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2.2.2. Bi-LSTM 

LSTM is a deep learning method involving sequential data [22]. It is an algorithm in the RNN family. 
LSTM processes data in a forward direction with the ability to remember and forget the information. LSTM 
consist of forget gate (fi), input gate (i,), input modulation gate (¢. ), cell state (c,), output gate (o,), and hidden state 


(h,) [23]. Forget gate is used to decide which information is kept to calculate the cell state and which information 
should be forgotten. The information of sample (x,) and previous hidden state (4.1) are fed through a Sigmoid 
function in forget gate that can be expressed by (1), where Wis the weight matrix and b is the bias vector. Input gate 
helps to find important information of a sample (x) with the previous hidden state (/,1). It helps to find out important 
information. The input gate can be expressed as (2). The input modulation gate is the candidate cell state. It learns 
both new information and the previous hidden state, as shown in (3). The cell state combines old information that is 
dropped by a forget gate and new information that is produced by the input gate and modulation gate, as shown in 
(4). The output gate gives the next hidden state, as shown in (5). The hidden state holds the information which is 
seen by LSTM, as shown in (6). LSTM’s extension, called BiLSTM, was proposed to learn both past and future 
input data sequences so data is processed in the forward direction and backward direction in parallel [24]. For 
BiLSTM the hidden state of forward direction and backward direction are saved. For text classification, BiLSTM 
views text as a sequence of words. A sample (x) is a word vector of words in a sentence [25]. In this paper, the 
structure of the proposed BiLSTM model for classifying deceptive messages is shown in Figure 3. 


fr = sigmoid (We [hy-1,X¢] + bp) (1) 
i, = sigmoid (W;,[hy_1, Xe] + B;) (2) 
é, = tanh (W,[hy-1, X¢] + Be) (3) 
Ce = fe Cea tie. & (4) 
0, = sigmoid (W, [hp_4,X¢] + b,) (5) 
hy = 0, .tanh (cr) (6) 
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Figure 3. The structure of the proposed BiLSTM model 


2.2.3. BiGRU 

GRU is an algorithm of the RNN family [26]. It requires sequential data for learning. To decrease 
computation time, GRU reduces external signal gates from LSTM as shown in (7) and (8), where U is a weight 
matrix. In addition, it includes two gates, called an update gate z,as shown in (9) and a reset gate r, as shown 
in (10). The model parameters (W, U, b) are shared at all time steps and learned during the training stage. In 
this paper, we propose BiGRU which allows for the use of information from both previous time steps and later 
time steps to make predictions about the current state. The proposed BiGRU model in this paper is presented 
in Figure 4. The hidden state of forward and backward directions are generated in the BiGRU layer and 
combined to go through the dropout layer. Finally, the output from the dropout layer goes to the output layer 
and predicts the class. 
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Figure 4. The structure of the proposed BiGRU model 


2.2.4. CNN-BiLSTM 

The proposed CNN-BiLSTM aims to learn the local features of the text input vector using CNN and 
then long-range dependency in the sequence of words is learned by BiLSTM. The proposed CNN-BiLSTM 
model is presented in Figure 5. The output from the convolutional layer is the feature map. Then the feature 
map is fed into the BiLSTM layer to learn the features in forward and backward directions. The hidden states 
of forward and backward directions are combined and fed through the dropout layer. Finally, the output from 
the dropout layer goes to the output layer and predicts the class. 
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Figure 5. The structure of the proposed CNN-BiLSTM model 


2.2.5. CNN-BiGRU 

The proposed CNN-BiGRU model learns the local features of text via CNN and then long-range 
dependency between the sequences of words is learned by BiGRU. The proposed CNN-BiGRU model is presented 
in Figure 6. The output from the convolutional layer is fed into the BiGRU layer to learn the feature in forward and 
backward directions. The hidden states of forward and backward directions are combined and go through the dropout 
layer. Finally, the output from the dropout layer goes to the output layer and predicts the class. 
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Figure 6. The structure of the proposed CNN-BiGRU model 


2.3. Performance evaluation 

This section explains the detail of the performance evaluation. All the proposed models are evaluated 
by using accuracy, precision, recall, and F-measure. Accuracy (Acc) measures the overall predictions that were 
correct (and is a reliable measure on our balanced dataset). The precision is the number of correct predictions 
from the predictions per class. The recall is the number of correct predictions from all true data for each class. 
F-measure is the harmonic mean between precision and recall. Precision, recall, and f-measure for the deceptive 
class are represented by Pp, Rp, Fp, respectively. Precision, recall, and f-measure for the truthful class are 
represented by Pr, Rr, Fr, respectively. 


3. RESULTS AND DISCUSSION 
3.1. Experimental settting 

The dataset is split into three subsets, training, validation, and test sets, at a ratio of 60:20:20. All of 
the subsets inherited the same characteristics of the original dataset, including class distribution and sentence 
length distribution. Adam optimizer with a learning rate of 0.001 is set for all models. All models are trained 
for 10 epochs. Dropout is used as a regularization technique in all models. The dropout value is set to 0.3. For 
the proposed CNN model, it applies 32 filters with a size 3x300 of each. All models are implemented in Python. 


3.2. Experimental result 

Table 1 reports the performance of the proposed models using the one-hot vector encoding technique. 
In Table 1, we can see that all the proposed models gave high classification performance. The proposed CNN 
model gave 98.33% accuracy. The proposed BiLSTM model had the best accuracy at 98.74%, and the highest 
recall for detecting deceptive messages at 99.15%. Moreover, BiLSTM provided the highest precision for 
predicting truthful messages at 99.16%. Conversely, the proposed BiGRU model resulted in the lowest 
accuracy of all the models at 97.27%. 

Among the hybrid models, CNN-BiLSTM had the highest accuracy at 98.53%. In addition, we found 
that the combination of CNN and BiGRU (CNN-BiGRU) improved the accuracy when compared to only 
BiGRU, where CNN-BiGRU gave 98.32% accuracy. In conclusion, Table 1 results show that the proposed 
BiLSTM outperforms CNN, BiGRU, CNN-BiLSTM, and CNN-BiGRU when using one-hot vector encoding. 
Figure 7 shows the convergence curve of the loss function for BiLSTM. We can see that at the final epoch, the 
training and validation losses reached a point of the minimal gap with stability between the two loss values, 
indicating a qualified model-to-data fit. 


Table 1. The performance of the proposed models with one-hot vector encoding (%) 


Deceptive Truthful 
Model Accuracy Pp Rp Fp Pr Ry Fr 
CNN 98.33 97.89 98.72 98.31 98.76 97.94 98.35 
BiLSTM 98.74 98.31 99.15 98.73 99.16 98.34 98.75 
BiGRU 97.27 96.20 98.28 97.23 98.33 96.31 97.31 


CNN-BiLSTM 98.53 98.31 98.73 98.52 98.74 98.33 98.54 
CNN-BiGRU 98.32 98.31 98.31 98.31 98.33 98.33 98.33 
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Figure 7. Convergence curve of the loss function for BiLSTM with one-hot vector encoding 


Table 2 shows the performance of the proposed models tested with the word embedding vector feature 
encoding technique. From Table 2, we can see that all the proposed models gave high classification accuracy, 
especially CNN, BiLSTM, BiGRU, and CNN-BiLSTM. The proposed CNN-BiLSTM model gave the highest 
accuracy, precision, and f-measure of deceptive messages which are 97.90%, 97.15%, and 97.95%, 
respectively. The CNN and BiGRU models gave the highest recall of deceptive messages. While the proposed 
CNN-BiGRU model gave the lowest accuracy at 95.59%. In conclusion, the proposed CNN-BiLSTM had the 
highest accuracy performance in the word embedding vector trial. Figure 8 shows the convergence curve of 
the loss function for CNN-BiLSTM. We can see that training loss and validation loss decrease to a point of 
stability and share a small gap, indicating a good fit. 


Table 2. The performance of the proposed models with word embedding vector encoding (%) 


Deceptive Truthful 
Model Accuracy Pp Rp Fp Pr Rr Fr 
CNN 97.17 95.24 9917 97.17 99.11 94.87 96.94 
BiLSTM 97.06 95.97 98.35 97.14 98.25 95.73 96.97 
BiGRU 97.69 96.39 99.17 97.76 99.12 96.15 97.61 


CNN-BiLSTM 97.90 97.15 98.76 97.95 98.70 97.01 97.84 
CNN-BiGRU 95.59 95.85 95.45 95.65 95.32 95.73, 95.52 
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Figure 8. Convergence curve of the loss function for CNN-BiLSTM with word embedding vector encoding 
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From Tables 1 and 2, we can conclude that the proposed models gave excellent accuracy performance 
(95.59% to 98.74%) for detecting deceptive messages. The combination of CNN and BiLSTM with word 
embedding data encoding gave the highest accuracy performance of the word embedding trial (97.90%). 
Ultimately, the proposed BiLSTM with one-hot encoding technique gave the best overall classification 
accuracy performance (98.74%) on the dataset when compared to CNN, BiGRU, CNN-BiLSTM, and CNN- 
BiGRU, under all trail conditions. 


4. CONCLUSION 

This paper proposed deep learning models to classify deceptive messages written in Thai language. 
Five classification models—-CNN, BiLSTM, BiGRU, CNN-BiLSTM, and CNN-BiGRU-were proposed and 
evaluated upon two different feature encoding techniques, one-hot encoding, and word embedding. From the 
experimentation, we found that all the proposed models gave excellent accuracy performance (95.59% to 
98.74%) upon the Thai deceptive messages dataset collected from Facebook pages. We interpret that each of 
the deep learning models’ high performance is due to their provision of semantic information extraction and 
their self-adaptability to extract highly discriminant features without intervention. Each of the proposed models 
gave high accuracy, recall, precision, and f-measure in detecting truthful and deceptive messages. The proposed 
BiLSTM model gave the best accuracy performance (98.74%) when features were encoded using the one-hot 
encoding technique. In future work, we will consider applying the proposed models to further datasets and run 
trials using part-of-speech tagging and semantic tagging to extract features. 
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